Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection
نویسندگان
چکیده
Let a seed, S, be a string from the alphabet {1, ∗}, of arbitrary length k, which starts and ends with a 1. For example, S = 11 ∗ 1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p. We refer to this new probability distribution as CnSp, for covered, with S being the seed. We present an efficient method to calculate this distribution exactly. Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.
منابع مشابه
Exact Distribution of a Spaced Seed Statistic for Applications in DNA Repeat Detection
Let a seed, S, be a string from the alphabet {1, ∗} which starts and ends with a 1. For example S = 11 ∗ 1. S occurs in a binary string B at position k if S can be positioned so that the last letter in S aligns with the kth letter in B, and each 1 in S aligns with a 1 in B. A 1 in B is covered by S if there exists some occurrence of S in B such that the 1 in B aligns with a 1 in the occurrence ...
متن کاملAmino Acid Classification and Hash Seeds for Homology Search
Spaced seeds have been extensively studied in the homology search field. A spaced seed can be regarded as a very special type of hash function on k-mers, where two k-mers have the same hash value if and only if they are identical at the w (w < k) positions designated by the seed. Spaced seeds substantially increased the homology search sensitivity. It is then a natural question to ask whether t...
متن کاملOptimal Spaced Seeds for Homologous Coding Regions
Optimal spaced seeds were developed as a method to increase sensitivity of local alignment programs similar to BLASTN. Such seeds have been used before in the program PatternHunter, and have given improved sensitivity and running time relative to BLASTN in genome-genome comparison. We study the problem of computing optimal spaced seeds for detecting homologous coding regions in unannotated geno...
متن کاملIndel seeds for homology search
We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguo...
متن کاملMultiple spaced seeds for homology search
MOTIVATION Homology search finds similar segments between two biological sequences, such as DNA or protein sequences. The introduction of optimal spaced seeds in PatternHunter has increased both the sensitivity and the speed of homology search, and it has been adopted by many alignment programs such as BLAST. With the further improvement provided by multiple spaced seeds in PatternHunterII, Smi...
متن کامل